## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
## [1] 651 32
## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## [1] 1970
## [1] 2014
The “movies” dataset has 651 entries and 32 variables.
Generalizability
According to IMDB’s official website, it is indicated that there were 9454 movies released between the year 1970-2014. Click here Our current dataset includes 651 entries. This is well within 10% of the total dataset. Therefore, proving the usage of random sampling to obtain the sample dataset.
Causality
Causality can be explained only for an experimental dataset and not for an observational dataset. Since the “movies” dataset is an observational dataset, no comment regarding causality can be made.
How do factors like genre, runtime, best picture nomination, best actor or actress win affect audience score apart from the regular factors like imdb rating, critics rating and year of release?
Why this question interests me:
Some people are not a fan of certain genres and there are chances that they would rate the movie low, even if it is a great movie in that genre. In my opinion, most movies which have been nominated for best picture at Oscar perform well with the audience. The popularity of actor and actress also affects how the audience rates the movie. Year and time of release play an important role too. Similarly, there are some people who depend on the critics rating and imdb rating before watching a movie. These are some reasons I’m interested in knowing how the above-stated factors affect audience score.
We notice that although there are 32 variables present in the “movies” dataset, not every variable will aid us in answering the research question. Instead, we will create a new dataset called “movie_newset” created by using the variables from the “movies” dataset. We will include the variables that are useful in answering the research question.
movies_newset <- movies %>%
select(title, title_type, genre, runtime, mpaa_rating, thtr_rel_year, thtr_rel_month, dvd_rel_year,imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box) #select the variables that are useful to answer the research question
movies_newset <- na.exclude(movies_newset) #exclude NA's from the dataset
dim(movies_newset) #gives the dimensions of the dataset## [1] 642 20
## Classes 'tbl_df', 'tbl' and 'data.frame': 642 obs. of 20 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'exclude' Named int 100 184 261 334 345 375 377 437 451
## ..- attr(*, "names")= chr "100" "184" "261" "334" ...
Let’s look at how some categorical variables perform against audience score. This can be understood by using a density plot. The density plot gives information about the probability of a categorical variable having a higher audience score. We shall look at the performance of the following categorical variable against audience score: -
These performances can be used as a factor while selecting the model.
ggplot(movies_newset, aes(audience_score, fill = title_type))+
geom_density(alpha = 0.3) #performance of title_type against audience scoreggplot(movies_newset, aes(audience_score, fill = genre))+
geom_density(alpha = 0.2) #performance of genre against audience scoreggplot(movies_newset, aes(audience_score, fill = mpaa_rating))+
geom_density(alpha = 0.3) #performance of mpaa_rating against audience scoreggplot(movies_newset, aes(audience_score, fill = critics_rating))+
geom_density(alpha = 0.3) #performance of critics rating against audience scoreggplot(movies_newset, aes(audience_score, fill = best_pic_nom))+
geom_density(alpha = 0.3) #performance of best_pic_nom against audience scoreggplot(movies_newset, aes(audience_score, fill = best_dir_win))+
geom_density(alpha = 0.3) #performance of best_dir_win against audience scoreggplot(movies_newset, aes(audience_score, fill = best_actor_win))+
geom_density(alpha = 0.3) #performance of best_actor_win against audience scoreggplot(movies_newset, aes(audience_score, fill = best_actress_win))+
geom_density(alpha = 0.3) #performance of best_actress_win against audience scoreggplot(movies_newset, aes(audience_score, fill = top200_box))+
geom_density(alpha = 0.3) #performance of top200_box against audience scoreThe following inference can be made by the above plots: -
From the above plots, title_type, genre, mpaa_rating, critics_rating, best_pic_nom, and top200_box give good insights about the audience score. We can use these categorical variables to build the model.
Let’s look at how some numerical variables perform against the response variable audience_score. The scatterplot will give us insights whether the plot is linear or not.
ggplot(movies_newset, aes(x = imdb_rating, y = audience_score))+
geom_jitter()+
geom_smooth(se = FALSE, method = "lm") #plot of imdb_rating vs audience_scoreggplot(movies_newset, aes(x = critics_score, y = audience_score))+
geom_jitter()+
geom_smooth(se = FALSE, method = "lm") #plot of critics_score vs audience_scoreggplot(movies_newset, aes(x = thtr_rel_year, y = audience_score))+
geom_jitter()+
geom_smooth(se = FALSE, method = "lm") #plot of thtr_rel_year vs audience_scoreggplot(movies_newset, aes(x = imdb_num_votes, y = audience_score))+
geom_jitter()+
geom_smooth(se = FALSE, method = "lm") #plot of imdb_num_votes vs audience_scoreggplot(movies_newset, aes(x = dvd_rel_year, y = audience_score))+
geom_jitter()+
geom_smooth(se = FALSE, method = "lm") #plot of dvd_rel_year vs audience_scoreggplot(movies_newset, aes(x = thtr_rel_month, y = audience_score))+
geom_jitter()+
geom_smooth(se = FALSE, method = "lm") #plot of thtr_rel_month vs audience_scoremovies_newset %>%
summarise(cor(audience_score, imdb_rating)) #gives the correlation value between audience_score and imdb_rating## # A tibble: 1 x 1
## `cor(audience_score, imdb_rating)`
## <dbl>
## 1 0.863
movies_newset %>%
summarise(cor(audience_score, critics_score))#gives the correlation value between audience_score and critics_score## # A tibble: 1 x 1
## `cor(audience_score, critics_score)`
## <dbl>
## 1 0.700
movies_newset %>%
summarise(cor(audience_score, thtr_rel_year))#gives the correlation value between audience_score and thtr_rel_year## # A tibble: 1 x 1
## `cor(audience_score, thtr_rel_year)`
## <dbl>
## 1 -0.0612
movies_newset %>%
summarise(cor(audience_score, imdb_num_votes))#gives the correlation value between audience_score and imdb_num_votes## # A tibble: 1 x 1
## `cor(audience_score, imdb_num_votes)`
## <dbl>
## 1 0.292
movies_newset %>%
summarise(cor(audience_score, dvd_rel_year))#gives the correlation value between audience_score and dvd_rel_year## # A tibble: 1 x 1
## `cor(audience_score, dvd_rel_year)`
## <dbl>
## 1 -0.0638
movies_newset %>%
summarise(cor(audience_score, thtr_rel_month))#gives the correlation value between audience_score and thtr_rel_month## # A tibble: 1 x 1
## `cor(audience_score, thtr_rel_month)`
## <dbl>
## 1 0.0399
We understand the folloiwing from the above models: -
Test for collinearity
Let’s check for collinearity, if there exists any, for all the numerical variables in the movies_newset dataset.
From the above plot, it is very evident that there is a high correlation between imdb_rating and critics_score. The correlation has a score of 0.762. Similarly, dvd_rel_year and thtr_rel_year has a correlation of 0.66.
The inference from the above statement is that usage of both imdb_rating and critics_score or dvd_rel_year and thtr_rel_year adds no value while making the model.
Based on the exploratory data analysis conducted. The audience score can be predicted using the following 10 variables: -
audience_score ~ title_type + runtime + genre + imdb_rating + imdb_num_votes + critics_rating + thtr_rel_year + best_pic_nom + best_dir_win + top200_box
audience_score_model <- lm(audience_score ~ title_type + runtime + genre + critics_rating + best_pic_nom + imdb_rating + imdb_num_votes + thtr_rel_year + best_dir_win + top200_box, data = movies_newset) #provides a linear model for audience_score using the above variables
summary(audience_score_model)##
## Call:
## lm(formula = audience_score ~ title_type + runtime + genre +
## critics_rating + best_pic_nom + imdb_rating + imdb_num_votes +
## thtr_rel_year + best_dir_win + top200_box, data = movies_newset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.462 -5.923 0.122 5.622 49.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.631e+02 8.008e+01 2.036 0.042129 *
## title_typeFeature Film 7.494e-01 3.621e+00 0.207 0.836109
## title_typeTV Movie 2.116e+00 5.726e+00 0.370 0.711846
## runtime -5.491e-02 2.348e-02 -2.339 0.019663 *
## genreAnimation 8.646e+00 3.517e+00 2.458 0.014228 *
## genreArt House & International 3.963e-01 3.038e+00 0.130 0.896256
## genreComedy 2.162e+00 1.637e+00 1.320 0.187162
## genreDocumentary 3.228e+00 3.889e+00 0.830 0.406878
## genreDrama 5.330e-01 1.429e+00 0.373 0.709273
## genreHorror -5.245e+00 2.404e+00 -2.182 0.029510 *
## genreMusical & Performing Arts 6.272e+00 3.365e+00 1.864 0.062804 .
## genreMystery & Suspense -5.366e+00 1.806e+00 -2.971 0.003086 **
## genreOther 1.040e+00 2.793e+00 0.372 0.709847
## genreScience Fiction & Fantasy -1.987e+00 3.685e+00 -0.539 0.589823
## critics_ratingFresh -2.364e+00 1.222e+00 -1.935 0.053425 .
## critics_ratingRotten -4.955e+00 1.321e+00 -3.750 0.000193 ***
## best_pic_nomyes 2.818e+00 2.331e+00 1.209 0.227032
## imdb_rating 1.487e+01 5.372e-01 27.682 < 2e-16 ***
## imdb_num_votes 4.435e-06 4.585e-06 0.967 0.333736
## thtr_rel_year -9.495e-02 3.952e-02 -2.403 0.016562 *
## best_dir_winyes -1.594e+00 1.626e+00 -0.980 0.327316
## top200_boxyes 5.159e-01 2.724e+00 0.189 0.849841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.762 on 620 degrees of freedom
## Multiple R-squared: 0.7729, Adjusted R-squared: 0.7652
## F-statistic: 100.5 on 21 and 620 DF, p-value: < 2.2e-16
The adjusted R-squared value for the current model is 0.7652 and the p-value is < 2.2e-16.
To fit the perfect model which could predict the response variable “audience_score”, we will make use of backward elimination method using adjusted R-squared as the criteria. The main reason why we’re using adjusted R-squared for model selection is because this criteria provides us with reliable prediction.
If we carry out backward elimination manually, it will be a time-consuming task. Instead we will use the step function to determine the model. Usage of this function will save a lot of time during model selection.
best_model <- step(audience_score_model, direction = "backward", trace = FALSE) #AIC based backward elimination method similar to Adj R-squared backward elimination method
summary(best_model)##
## Call:
## lm(formula = audience_score ~ runtime + genre + critics_rating +
## imdb_rating + thtr_rel_year, data = movies_newset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.281 -6.029 0.190 5.483 49.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 144.78196 76.24389 1.899 0.05803 .
## runtime -0.04714 0.02197 -2.145 0.03233 *
## genreAnimation 8.58994 3.50461 2.451 0.01452 *
## genreArt House & International -0.03871 2.99099 -0.013 0.98968
## genreComedy 2.10082 1.62211 1.295 0.19576
## genreDocumentary 1.92799 2.04548 0.943 0.34627
## genreDrama 0.36446 1.39008 0.262 0.79326
## genreHorror -5.31573 2.39106 -2.223 0.02656 *
## genreMusical & Performing Arts 5.33036 3.12207 1.707 0.08826 .
## genreMystery & Suspense -5.47038 1.79163 -3.053 0.00236 **
## genreOther 1.52787 2.76016 0.554 0.58009
## genreScience Fiction & Fantasy -1.97938 3.67608 -0.538 0.59046
## critics_ratingFresh -2.97365 1.15186 -2.582 0.01006 *
## critics_ratingRotten -5.41478 1.27014 -4.263 2.33e-05 ***
## imdb_rating 15.04544 0.51198 29.387 < 2e-16 ***
## thtr_rel_year -0.08598 0.03782 -2.273 0.02335 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.745 on 626 degrees of freedom
## Multiple R-squared: 0.7715, Adjusted R-squared: 0.766
## F-statistic: 140.9 on 15 and 626 DF, p-value: < 2.2e-16
The simplified model after using the backward elimination has 5 variables i.e runtime, genre, imdb_rating, critics_rating and thtr_rel_year. Runtime, imdb_rating, thtr_rel_year and critics_rating are the significant variables for this model. The adjusted R-squared value is 0.766 and the p value is < 2.2e-16.
The best_model becomes our parsimonious model here. According to the best_model: -
audience_score ~ runtime + genre + critics_rating + imdb_rating + thtr_rel_year
Model diagonostics for the Multiple Linear Regression Model
The following conditions need to be satisfied to confirm the model is a reliable model.
1. Linear Relationship between numerical variable and the response variable.
This can be checked by plotting the numerical variable against the residual value.
#Scatterplot for residuals vs runtime
ggplot(data = movies_newset, aes(x = movies_newset$runtime, y = best_model$residuals))+
geom_point(color = "orange")+
geom_hline(yintercept = 0, linetype = "dashed")+
xlab("Runtime")+
ylab("Residuals")#Scatterplot for residuals vs imdb_rating
ggplot(data = movies_newset, aes(x = movies_newset$imdb_rating, y = best_model$residuals))+
geom_point(color = "blue")+
geom_hline(yintercept = 0, linetype = "dashed")+
xlab("imdb_rating")+
ylab("Residuals")#Scatterplot for residuals vs thtr_rel_year
ggplot(data = movies_newset, aes(x = movies_newset$thtr_rel_year, y = best_model$residuals))+
geom_point(color = "aquamarine4")+
geom_hline(yintercept = 0, linetype = "dashed")+
xlab("thtr_rel_year")+
ylab("Residuals")From the above three plots we notice that for all the three variables, residual is scattered randomly around 0. This shows that that there is a linear relationship between all the numerical variables and the residuals.
2. Nearly normal residuals with mean 0
This can be checked by plotting a histogram and a normal probability plot. Ideally, for a linear regression, the histogram should be centered around 0.
From the above plots, we observe that the residuals are indeed centered at 0. The histogram and the normal probability plot show that the plot is right-skewed but most of the residual is along the line indicating that the model is linear.
3. Constant variability of residuals
The constant variability of residuals condition can be checked by plotting the predicted values against the residuals. For a linear regression, the plot should be scattered randomly around 0 without having any shape.
#scatterplot of residuals vs predicted values
ggplot(data = best_model, aes(x = .fitted, y = .resid))+
geom_point(col = "brown")+
geom_hline(yintercept = 0, linetype = "dashed")+
xlab("Predicted Value")+
ylab("Residual Values")The above plot shows that the scatter is random. This proves constant variability of residuals.
4. Independent Residuals
The reason for plotting the residuals is to check for the independence of residuals and identify if there exists any time series. We could check for the independence of residuals.
From the above plot, it is clear there is no time series present in the dataset. The residuals are scattered randomly around 0 indicating the independence of residuals.
To check how our final model “best_model” performs, let us see how our model performs on some examples. First, we try to predict the audience_score using one of the biggest blockbusters of 2018, “Avengers:Infinity War”.
#audience score prediction for Avengers:Infinity War
avengers_infinity_war <- data.frame(runtime = 149, genre = "Science Fiction & Fantasy", critics_rating = "Certified Fresh", imdb_rating = 8.5, thtr_rel_year = 2018)
predict(best_model, avengers_infinity_war)## 1
## 90.15883
Our model predicts that the audience_score is 90.16% while the real audience_score on Rotten Tomatoes is 91%. Our model’s estimate is very close to the original prediction. To check if the actual value of the audience_score falls within the interval, let us look at the interval prediction at 95% confidence level.
#predicting the confidence interval for Avengers:Infinity War using the best fit model
predict(best_model, avengers_infinity_war, interval = "prediction", conf.level = 0.95)## fit lwr upr
## 1 90.15883 69.6007 110.717
The actual value, 91%, falls well within the 95% confidence interval of (69.6007, 110.717).
Let’s look at another movie from a different genre with not so good imdb_rating or critics_rating to understand the prediction of the model. We’ll look at the audience_score prediction of the Bond movie “Live and Let Die”
#audience score prediction for Live and Let Die
live_and_let_die <- data.frame(runtime = 121, genre = "Mystery & Suspense", critics_rating = "Fresh", imdb_rating = 6.8, thtr_rel_year = 1973)
predict(best_model, live_and_let_die)## 1
## 63.30584
The audience_score prediction according to our model was 63.3%. The actual value of the audience_score was 65%. The prediction value of our model is very close to the actual score.
#predicting the confidence interval for Live and Let Die using the best fit model
predict(best_model, live_and_let_die, interval = "prediction", conf.level = 0.95)## fit lwr upr
## 1 63.30584 43.90693 82.70475
The actual audience_score falls within the 95% confidence interval of (43.90, 82.70).
With these two examples, it is evident that the “best_model” has good prediction and low error rates (0.92 % in the case of Avengers: Infinity War and 2.61% in case of Live and Let Die). The actual value of the audience_score is also within the 95% confidence interval. The references are in “Part 7: References” section.
Based on our research, we conclude that runtime and genre were the two variables which were a better fit for predicting the audience score apart from the regular variables like imdb rating, critics rating and year of release. Contradictory to my assumptions, best actor or actress win did not influence the audience score as much as the genre or runtime. We can observe this in the density plot between audience score and best actor/actress win.
The parsimonious model obtained does a decent job in predicting the audience score which is almost close to the actual audience score. We can reduce the error rate by including other real-time variables like box office collections, popularity rating of the franchise (Eg. Marvel Cinematic Universe), etc.